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1.  Introduction 

'The  architectunl  differences  between  n  serial  and  a  parallel  machine  raise  a  number  of 
questions  regarding  the  efficiency  of  established  algorithms'^  tliis  paper  explorei^eTeral 
algorithms  which  solve  the  Poisson  equation  on  rectangular  regions  in  two  dimensions.  The 
solution  of  the  Poisson  problem  is  an  example  of  one  of  ^  simplest  nontrivial  computations 
which  frequently  occur  in  innermost  loops  of  large  scale  scientiHc  codes,  and  hence  is  a  useful 
test  of  different  architectures  for  scientific  computation.  We  eompars-solntion  times  on  the  Vax 
11/780  with  solution  times  on  the  Floating  Point  System  164  (FPS>164)  attached  proeessorV*^ 
Since  the  FPS-164  supports  a  sufficiently  large  memory  and  the  host/attached  processor  I/O  is 
relatively  slow,  it  is  of  interest  to  solve  large  problems  entirely  on  the  FPS>164.  We  explore  the 
performance  of  the  FPS*164  on  both  portable  FORTRAN  programs  which  have  not  been  tuned 
to  its  architecture  and  on  moderately  tuned  FORTRAN  programs  which  make  calls  to  the  FPS 
assembly  languake  math  library,  MATHLIB.  Use  of  MATHLIB  results  in  shorter  programs 
which  are  usually  more  efficient.  We  show  that  the  speedup  in  execution  time  is  more  uniform 
across  the  algorithms  than  might  be  anticipated  and  hence  the  choice  of  algorithm  is  still  highly 
significant. 


In  Section  2  we  outline  three  standard  dgorithms,  the  complete  Fourier  algorithm  based  on 
double  Fourier  transforms,  the  Fourier/Tri^agonal  algorithm  and  the  FACR(l)  algorithm.  In 
Section  3  we  examine  the  tridiagonal  linear  system  solution,  which  is  part  of  the  latter  two 
algorithms,  in  more  detail.  We  analyse  the  classical  Gauss  algorithm  and  examine  a  refinement 
of  the  tridiagonal  linear  system  solution  based  on  ideas  of  Tiialcolm  and  Palmer  [9],  which 
appear  to  be  too  little  known  [12].  In  Section  4,  we  consider  optimisations  of  Ckroley’s  [2] 
method  for  calculating  the  sine  transform  as  a  real  transform  of  half  sise.  We  inesent  timings 
from  our  implementation  of  each  of  the  algorithms  for  the  Poisson  problem  in  Section  5. 
Finally,  we  give  concluding  remarks  in  Section  0. 
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2.  Three  Baaie  Algorithms 

We  begin  by  describing  some  fast  direct  methods  which  are  commonly  nsed  to  solve  the  * 

Poisson  equation 

I 

-  A  u  =  f  (1) 


with  Dirichlet  bonndaiy  conditions  in  two  dimensional  rectangles. 

For  simplicity  in  notation  we  consider  only  the  case  of  the  sqnare.  Define  the  grid  points 
■■  (ih,kh)  fw  i,k  ■■  0,...,n+l  with  h  «  l/(n+l)  and  the  corresponding  function  valnes 
Oji^  mi  n(xj  and  f.^  aw  f(xpy|^).  We  considCT  the  simple  6-point  discrete  Poisson  equation 

-  Ujj  =  fjj.  i.k  =  1 . n.  (2) 


where  the  6-point  Laplacian  operator  is  defined  by 

“it  ■  Jr  t  ^  “i-i.k  ^  “iM.k  ^  “i.k-i  ^  “i.k^i  1 


subject  to  the  boundary  conditions 


(8) 


"Ok 


*  u 


•«I.k 


s  U 


iO 


i.k 


1 . n. 


(4) 


In  Sections  2.1'2.3,  we  describe  three  well  known  fast  direct  methods  for  solving  the  discrete 
Poisson  equation. 


r 


2.1.  Th«  Complete  Foarler  algorithm 

The  approach  of  using  Fast  Fourier  techniques  to  solve  the  above  finite  difference  equations 
origiaatcs  with  Hocknqr  (8).  The  first  method  we  describe  involves  Fourier  transforms  in  each 
variable. 

We  extend  the  sequences  a  m  {u-^}  and  f  m  {fg^}  to  be  odd  doubly  periodic  sequences  of 
period  2(n-kl)  in  both  variables.  TIus  is  valid  since  n  satisfies  the  boundary  valnes  (4)  and  we 
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may  set  ■■  *  fjo  1  ^  ^  eqnation  (2)  does  not  restrict  the  values  of  f  on 

the  boondaiy. 

For  the  5-point  Laplacian  operator,  let  d  as  {d-^}  be  the  doubly  periodic  sequence  of  period 
2(n-fl)  defined  by 

**00  =  *'  **-1.0  =  **1.0  =  **0.-1  =  **0.1  =  **iii  *  ®  Othsruias. 

With  this  notation,  we  may  rewrite  eqnation  (2)  as  the  convolution 

d  •  u  =  h*  f  (5) 

which  can  be  solved  for  a  by  using  discrete  double  Fourier  transforms. 

Let  X  denote  the  discrete  Fourier  transform  of  the  doubly  periodic  sequence  x  with  period 
2(n-(-l )  in  both  variables.  Equation  (5)  then  implies 

d  .  tt  »  h*  f  .  (6) 


The  Fourier  transform  d  of  d  can  earily  be  cdcnlated  to  be 


and  therefore  we  can  solve  for  Uy^: 


“it  *  • 


The  discretisation  of  the  Poisson  problem,  represented  by  equation  (2)  can  now  be  solved  by 
the  following  algorithm: 

A 

1.  Cslcttlats  the  Fourier  transfora  f. 

2.  Calculate  v  by  (7). 

A 

8.  Parforn  the  inwaraa  Fourier  tranaforn  to  recover  n  fron  n. 
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We  remark  that  since  u  is  real  and  odd,  it  sofnces  in  steps  1  and  2  to  nse  a  sine  transform  in 
place  of  a  Fonrier  transform. 

Tlie  complete  Fonrier  algoritiun  also  provides  an  easy  way  to  solve  the  0>pomt  discreUsation 
and  other  more  precise  approximations  to  equation  (2),  see  Henrici  [7].  Far  example,  if  vre  nse 
the  fourth  order  9-poiat  discretisation  for  the  Laplacian 


“i-l.k-1  *  “i-J.k*l  ♦  “i*l,k-l  *  “i*l,kM  1 
and  for  the  right  hand  side  of  equation  (2)  we  nse  the  mean  valnes  g  as  where 

®^ik  *  ^1-l.k  *  ^i*l.k  *  ^i.k-1  *  ^i.k*l  ^  ' 


we  get  an  equation  of  the  same  form  as  (5)  to  solve: 


d  *  a  «  g  . 


In  this  ease,  we  have 


®00  *  20- 


^-1.0  ■  ^1.0  ®  ®0,-l  *  ®0,1  ' 


d.j  .j  *  ^-1  1  ■  -1  ■  dj  I  *  "1.  and  d.^  =  0  otheruiae. 

S.S.  The  Fonrkur/Trldlatonnl  slforlthm 
In  the  literature,  this  algorithm  is  i^ten  rimply  called  the  Fourier  or  basic  Fourier  algorithm, 
see  Temperton  |12].  It  is  based  on  matrix  decomposition,  see  Busbee,  Golub  and  Nielson  {!]. 

The  diseietised  Poisson  equation  (2H3),  can  written  in  matrix  form: 

Hu*  h*  f  an  y  (11) 


"'h.''  >Jv,.  *• . 
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where  the  rector  a  (similarly  f  and  y)  may  be  written 


m  m 

“l 

■  “il  ■ 

n  a 

»2 

and  U|  a 

“i2 

• 

L  u  J 

“i.-* 

and  the  n^xn^  matrix  M  is  block  tridiagon^  of  block  order  n 


I  O 

I  A  I 
I  A  ■ 


where  A  b  the  tridiagonal  nxn  matrix  b  defined  by 


The  eigenralnes  and  eifenrectors  r^  a  {r^}  of  A  are  known  to  be 
s  -  4  >  2  cos  k  =  I . «  . 

*ik  =  i/iiTr  (iJr)'  1 . "  • 

With  a  V  a  ud  A  the  diagonal  matrix  diag(X|^),  we  hare 
VAV  »  A  . 

We  may  rewrite  equation  (11)  w 


(12) 


(13) 


(14) 

(IB) 


(16) 


0 


Au,  ♦  tt2 

ww 

K 

II 

»i-l  *  »i*l 

=  y  -,  • 

".-1  ^ 

=  y.  • 

16)  this  becomes 

vlu,  4 

mmm 

=  ^1  • 

0mm  mmm  0mm 

tt.  .  ^  Au-  *  u.  * 

1-1  1  1*1 

II 

=  y.  • 

i  =  2 . n-1 


i  s  2 . »-l 


where 


“i  =  V«,  . 


•nd 


y*  =  Vy,  . 

If  we  reanwnge  the  equations  (18),  we  get  for  each  k  « 


*  "21  "  • 

Vi»  =  yit  •  *  =  2 . 

Vi.k  =  y.k  • 

Thus  we  get  the  following  algorithm: 

1.  Perforn  the  trensfornetion  (20). 

2.  Solve  the  n  tridiegenel  linear  systena  (21). 
8.  Solve  (19)  for  a,,  i.e.  perforn  the  inverse 

transforn. 


2.8.  The  FACR(1)  algorithm 
FoIIoiriBg  Hockney  [8],  we  start  by  elimiaatins  the  vectors  Uj  with  odd  indices  from  equation 
(17),  i.e.,  consideritts  u  as  a  matrix,  we  eliminate  the  odd  rows.  We  could  also  eliminate  the 
odd  columns  of  u,  f,  see  Temperton  [12]. 

In  fact,  for  indices  i  4,6,...,n-3  , 

«i-2  *  *“i-l  *  “i  =  ' 

«i-l  *  ^  •  ^22) 

“i  *  “i*2  =  • 

and  similar  equations  for  i  2  and  i  >■  n-1.  Multiplying  the  middle  equation  of  22  by  -A  and 
adding  all  three  equations,  we  get 

U..2  -  (2I-A2)ttj  -  Uj.2  =  Xi-I  -  AXi  ♦  Xi., 

and  hence  the  following  system,  where  we  assume  that  n  is  odd.  (For  the  case  where  n  is  even, 
the  last  equation  will  differ.) 

y,  ♦  Xs  ‘  *>2 

Xs  *  Xs  -  AX4 

(23) 

X,-«  ^  X,.2  -  Ay,., 

X.-2  ^  X,  -  Ay,., 

Continiung  this  step  of  eliminating  of  every  other  row  leads  to  the  algorithm  of  cyclic 
reduction.  In  the  literature  this  is  sometimes  refered  to  as  the  FACE  (Fourier  and  Cyclic 
Reduction)  algorithm.  We  stop  after  one  reduction  step  and  use  the  Fourier/Tridiagonal 
algorithm  to  solve  system  (23).  This  is  called  the  FACR(1)  algorithm.  We  replace  matrix  A  of 
Section  2.2  with  the  matrix  21  -  A^.  Since  B  21  •  A^  n  a  polynomial  p  ■■  p(A)  in  A,  it  has 
the  same  eigenvectors  as  A  and  rigenvalues  Xg  «  p(^a)  *  2  •  Xj^^.  The  solution  on  the  odd 
rows  is  obtained  by  solving  the  tridiagonal  system  corresponding  to  the  middle  equation  of  (22) 
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Au,  =  Xj  -  Uj  . 

-  “i-l  "  »i*l  •  *  =  . "•2* 

=  y.  -  “.-1 

The  algorithm  is  summarised  as  follows: 

1.  Set  up  the  right  hand  side  of  (23). 

2.  Solve  (23)  by  the  algoritha  fro*  (2.2). 

3.  Calculate  the  solution  on  the  odd  rows  fros  (24). 
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S.  Refinement  of  the  Trldlngonnl  Llnenr  System  Solution 
Tmo  of  tbe  algorithms  for  solving  the  Poisson  equation  in  reetangnlar  regions  require  solving 
many  tridiagonal  systems  of  linear  equations. 

The  problem  to  be  solved  is  of  the  form 

k  t  =  y  (25) 


where  A  is  an  nxn  tridiagonal  matrix  of  the  form 


A  = 


'  X 
1 

o 


1 

X  1  o 

1  X  1 

I  X 


(26) 


For  all  X  we  have  |X|  >  2.  Therefore  the  system  is  definite  and  we  are  assured  of  the  existence 
of  a  solution. 

Basically  there  are  two  important  elTiciency  issues:  (1)  avoiding  divisions,  and  (2)  keeping  the 
pipelines  filled.  In  Section  3.1  we  discuss  Gauss  elimination  and  in  Section  3.2  we  describe  the 
improvements  of  the  Malcolm  and  Palmer  method  [9]. 


8.1.  The  clamie  Gsnm  algorithm 

Tbe  easiest  way  so  solve  the  system  (25)  is  the  well  known  Gauss  elimination  in  three  steps: 
1.  LR'factorisation:  A  »  LR^  with 


■  1  '■l 

* 

- 

1 

0 

.  hf* 

o 

1  ^-1 

1 

and  R  as 

1 

O 

o 

1 

nst  Botolioii  A  —  LR  in  preference  to  the  ueuni  LU  notation  to  avoid  eonfuwon  with  the  a  from  Section  2. 


From  A  LR  we  get  recursion  formulas  for  the  elements  and  rj^; 


/ 


1 


=  X. 


1  I 

^k*l  '‘k  “  ^ 


2.  Forward  substitution:  Lw  *  y, 


3.  Backward  substitution:  Rx  w 


k  =  1 . n. 


k  =  2 . n. 


K 


A 


*k  =  -k  ■ 


•‘k*k»i 


k  =  n-1 . 1 


This  algorithm  requires  n  multiplications,  2n  divisions  and  3n  additions/subtractions. 


Since  and  rj^  are  reciprocal  values  we  need  only  one  of  them.  To  avoid  as  many  divisions 
as  possible  we  eliminate  /|^.  If  we  define  r^  sw  l/l^  we  get: 


1. 


2.  Wj  =  r,,,  . 


3. 


».  = 


S  =  -  Vi>  • 


*k  =  “k  - 


••kS*! 


k  =  2 . n.  (28) 


k  =  2 . n.  (29) 


k  s  n-1 . 1.  (30) 


This  form  of  the  algorithm  needs  2n  multiplications,  n  divisions,  and  3n 
additions/subtractions. 

In  applications  where  we  have  to  solve  several  Poisson  equations  with  the  same  mesh  site,  we 
could  precompute  the  but  this  would  double  the  amount  of  storage. 
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S.2.  A  more  efneient  algorithm  to  solve  the  trldlagonal  •jrstems 
There  exist  several  algorithms  which  improve  the  storage  requirements  of  the  tridiagonal 
solver.  The  algorithm  of  Evans  [3]  and  Evans  and  Hatsopoulos  [4]  reduces  the  number  of  stored 
elements  rj^  from  n  to  n/2.  Fischer  et  al  [S]  use  a  Fourier-Toeplitt  method  which  does  not  need 
to  store  any  of  the  rj^  but  which  requires  more  operations. 


On  many  machines,  reducing  the  division  count  improves  the  efficiency  of  an  algorithm.  This 
is  particularly  true  on  the  FPS-104,  as  division  is  done  in  software  and  consumes  much  more 
time  than  any  other  basic  operation.  The  relevant  times  are 


APFTN  single  division 
APFTN  vector  division 
APAL  single  division 
APAL  vector  division 


29  cycles 
18  eyeles/elenent 
22  cycles 
7  cycles/elenent 


as  compared  to 

Bultipl icetion  3  cycles 

addition  2  cycles 

Note  that  multiplication  and  addition  operations  may  be  initiated  every  cycle.  Significant 
savings  can  therefore  be  made  by  reducing  the  number  of  divisions. 


Malcolm  and  Palmer  (9]  derive  an  algorithm  which  applies  to  linear  jrstems  with  real, 
^mmetric,  diagonally  dominant,  tridiagonal  coefficient  matrices  with  constant  diagonab.  This 
algorithm  needs  both  fewer  operations  and  much  less  storage.  Malcolm  and  Palmer 
demonstrate  that  the  entries  1^  and  the  r^  of  Section  3.1  converge,  and  give  lower  and  upper 
bounds  on  the  rate  of  convergence,  as  a  function  of  X,  to  a  relative  error  of  c.  We  use  an 
equivalent  upper  bound. 


where  ui  is  defined  by  <«;  m  Sj/sj,  and  where  and  Sj  ^  solutions  of  t  =  1/(X  •  s),  the 
equation  satisfied  by  the  limit  of  rj^,  such  that  |B||  >  {B2I'  We  derive  Icq  in  Appendix  A.  To 
solve  each  tridiagonal  system,  we  compute  from  equation  (31),  perform  only  steps  in  the 
Gaussian  elimination  and  use  Uj^  for  the  rernmning  entries  in  the  factor.  It  is  this  method  that 
we  use  for  the  tridiagonal  solver  in  our  implementation  of  the  Fourier/Tridiagonal  and 
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PACR(1)  metbods  for  aolving  bli«  Poisson  problem. 

If  |X|  is  not  too  close  to  2  we  hive  remarkably  fast  convergence,  and  u  the  exact  index  of 
convergence.  In  Table  3-1,  we  show  the  actnd  index  of  convergence  for  c  ~  IC**  for  varying 
values  of  X.  In  Table  3-1,  k^^^^  denotes  the  index  where  convergence  actually  takes  place,  (see 
Appendix  A),  and  k^  denotes  the  estimate  from  (31). 


1  X  1 

^co»* 

''o 

2.0001 

1445 

1842 

2.0010 

402 

582 

2.0100 

167 

184 

2.1000 

56 

58 

2.5000 

26 

26 

3.0000 

10 

10 

4.0000 

13 

13 

S.OOOO 

11 

11 

6.0000 

10 

10 

Table  S-li  Index  of  convergence  (with  t  -  lOT*^)  in  tridiagonal  solver 

This  gives  a  dramatic  reduction  in  the  number  of  divisions  when  we  solve  Poisson’s  equation 
using  a  nxn  grid.  In  thb  case  we  solve  n  systems  of  form  (25)  with 


X  *  X^  =  -  4  ^  2  cos  . 


k  =  1 . n. 


In  Tabic  3-2,  we  display  a  count  of  the  number  of  divisions  reqwred  for  problems  of  varying 


rise. 


Due  to  the  cost  of  evaluating  (31),  we  start  to  cut  the  amount  of  work  only  when  n  >  31, 
but  tor  larger  n  we  have  eliminated  almost  the  entire  LR-factorisation.  Our  programs  use  the 
improved  tridiagonal  solver  only  for  problems  of  sise  >  31. 

The  effect  of  this  method  on  the  Fourier/Tridiagonal  algorithm  for  the  Poisson  problem  may 
be  seen  by  comparing  the  times  spent  solving  n  linear  systems  in  the  solution  of  the  Poisson 
problem.  Refer  to  the  Fourier/Tridiagonai  algorithm  in  Section  2.2. 

This  method  largely  eliminates  any  advantage  of  preprocessing  in  that  few  faetws  need  be 
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f  divisions 

#  divisions 

n 

in  the  originsi 

using  our  indsi 

% 

sigorithn  =  n^ 

of  convergence 

7 

49 

49 

100.00 

15 

225 

202 

89.78 

31 

061 

560 

58.27 

63 

3969 

1402 

35.32 

127 

16120 

8330 

20.70 

255 

65025 

7733 

11.80 

511 

261121 

17565 

6.78 

1023 

1046529 

80206 

3.75 

2047 

4100200 

86010 

2.07 

Table  3-2:  Division  count  in  the  two  tridiagonal  solvers 

n 

tine  for  the 

tine  for 

s 

original  sigorithn 

new  sigorithn 

31 

0.0088 

0.0001 

103 

63 

0.0388 

0.0244 

72 

127 

0.1331 

0.0704 

53 

265 

0.6271 

0.2196 

42 

511 

2.1036 

0.7414 

35 

Table  3-3 <  Execution  time  (sec)  to  solve  n  tridingonnl  qrstems  in  APFTN 

computed;  hence  the  need  to  store  precomputed  factors  is  eliminated.  In  addition,  the  solntion 
of  the  tridiagonal  systems  u  no  more  expensive  than  the  normalisation  in  the  complete  Fourier 
method,  implying  that  the  Fourier/Tridiagonal  method  b  necessarily  faster  than  the  complete 
Fourier  method,  regardless  of  the  speed  of  the  FFT. 


A  second  major  efTiciency  issue  is  pipelining.  At  cost  in  storage,  «e  explore  the  tradecrff  of 
the  faster  run  time  that  becomes  possible  by  solving  4,  8,  18,  32,  or  n  qrstems  in  parallel, 
performing  forward  solves  for  multiple  systems  followed  by  the  back  solves.  The  results  are 
presented  in  Section  5. 
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4.  The  Fourier  Transform 

In  thu  section,  we  compare  the  efficiency  of  different  Fonrier  Tlnnsfonns  implemented  on  the 
VAX  and  on  the  AP  in  FORTRAN,  and  on  the  AP  nsinf  the  FPS  APMATH  library  routines. 
We  used  compiler  optimisation  level  3  of  the  Revision  D04  APFTN  compiler.  We  took 
advantage  of  the  problem  being  both  real  and  odd  to  implement  the  sine  transform  for  the 
FFT,  and  we  also  present  results  using  the  real  FFT.  The  APMATH  library  includes  a  routine 
for  computing  a  real  FFT,  but  not  a  sine  transformation.  We  implement  our  sine 
transformation  using  the  algorithm  by  Cooley  [2]. 


prob  airs  n 

VAX 

(FORTRAN) 

AP 

(FORTRAN) 

AP 

(LIBRARY  ROI 

128 

0.030 

0.0020 

0.0014 

266 

0.08 

0.0078 

0.0082 

512 

0.10 

0.0150 

0.0066 

1024 

0.30 

0.0312 

0.0147 

2046 

0.87 

0.0640 

0.0202 

4006 

2.28 

0.1510 

0.0645 

8102 

4.68 

0.8186 

0.1885 

16884 

10.8 

0.6561 

0.2018 

Taklw  4>lt  ExecutiM  time  (sec)  for  one  complex  FFT 


prob  size  n 

VAX 

(FORTRAN) 

AP 

(FORTRAN) 

AP 

(LIBRARY  ROUTINES) 

128 

0.031 

0.0028 

0.0000 

256 

0.04 

0.0048 

0.0017 

512 

0.11 

0.0008 

0.0087 

1024 

0.24 

0.0201 

0.0076 

2048 

0.51 

0.0412 

0.0166 

4006 

1.12 

0.0886 

0.0830 

8102 

2.72 

0.1024 

0.0721 

16384 

5.70 

0.8067 

0.1487 

Table  ^St  Execution  time  (sec)  for  one  real  FFT 

^‘11  :b  4'1,  4-2,  and  4*3,  we  present  the  times  for  one  FFT  for  problems  of  one  dimension. 
Note  that  the  real  and  the  complex  FFT  require  arrays  of  length  double  the  problem  siae.  The 
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prob  size  n 

VAX 

(FORTRAN) 

AP 

(FORTRAN) 

AP 

(Cooley) 

128 

0.02 

0.0015 

0.0006 

256 

0.03 

0.0028 

0.0013 

512 

0.05 

0.0054 

0.0021 

1024 

0.13 

0.0121 

0.0046 

2048 

0.25 

0.0244 

0.0001 

4006 

0.56 

0.0407 

0.0201 

8102 

1.22 

0.1022 

0.0400 

16384 

2.05 

0.2275 

0.0882 

Tabla  4-3t  Execution  lime  (see)  for  one  sine  tnnsfonn 

algorithm  for  the  sine  transform  presented  by  Cooley  [2]  iuTolves  pack  and  unpack  operations  in 
addition  to  a  real  transform.  As  the  times  for  onr  sine  transform  are  significantly  higher  than 
the  expected  time  of  an  APAL  version,  in  onr  Poisson  problem  we  Teetwise  the  computation  of 
the  n  sine  transforms  by  packing  and  unpacking  some  number  of  vectors,  m,  in  one  loop.  This 
allows  the  APFTN  compiler  to  take  significant  advantage  of  pipelining.  The  storage  cost  of 
performing  m  sine  transforms  in  parallel  is  nKm/2.  We  save  .0002  seconds  per  transform  in  the 
»  255  by  255  site  problem,  which  reduces  the  number  of  seconds  per  sine  transform  from  .0013  to 

.001.  This  compares  very  respectably  to  the  time  for  a  real  transform  of  .0017  seconds.  In  the 
511  by  511  size  problem,  the  number  of  seconds  per  sine  transform  is  reduced  from  .0021  to 
.0018.  The  time  for  the  real  transform  is  .0037  seconds.  Onr  tables  in  Section  5  refer  to  the 
vectorized  sine  transform  by  the  notation,  Vecsin.  Unless  the  vectorized  sine  method  is 
speciflcally  mentioned,  the  sine  transform  we  implemented  is  the  non>veetorised  implementation 
of  Cooley’s  algorithm. 

Note  that  as  u  shown  in  Table  4-4,  the  complete  Fourier  algorithm  has  the  most  to  gain  from 
an  efficient  sine  transform.  In  section  5  we  show  that  even  with  the  vectorised  sine  transform, 
the  complete  fourier  algorithm  is  not  competitive  with  the  FACR(1)  method. 


Ifi 


Sin*  transf.  of  l•ngth  n 
Trid.  sjrsttm  of  ordtr  n 
Trid.  sjrst*M  of  ord*r  n/2* 

n/2  •ajr  b*  a  i that  n/2  or  n/2  t  1/2 

Table  4-4i  Count  of  components  required  in  each  algoritlun 


Fourlar 

4n 

0 

0 


Fouriar/Trid 

2n 

n 

0 
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6.  Nomerle*!  Results  for  the  Poisson  Problem 

We  present  exeention  times  for  the  two  dimensionsi  Poisson  problem  using  the  slforithms 
described  in  the  prerions  sections.  Each  algorithm  was  applied  to  discretitations  vaiying  in  rise 
from  31  X  31  points  to  511  x  511  points.  Each  algcuithm  was  implemented  in  FORTRAN??  on 
the  VAX  11/780  with  a  FPA,  in  APFTN84  on  the  FPS-164,  and  in  APFTN64  with  extensive 
calk  to  APMATH  library  rontines  from  the  FPS  math  library.  We  used  compiler  optimisation 
level  3  of  the  Revision  D04  APFTN  compiler.  The  FPS-164  snppmts  sufliciently  large  memory 
that  the  problems  run  on  the  FPS-164  were  solved  entirely  on  the  PPS-164. 

One  set  of  tests  was  run  using  the  real  Fast  Fourier  Transform  for  the  FFT.  A  second  set  of 
tests  was  run  substituting  the  sine  transformation  for  the  real  FFT,  cf.  Section  4. 

The  times  apply  to  the  solution  of  the  Poisson  equation  and  do  not  include  the  overhead  costs 
of  the  transfer  of  program  and  data  to  the  AP  or  of  paging  on  the  VAX.  These  are  considered 
separately.  The  VAX  was  run  single-user  with  an  unrealistically  large  working-set  to  eliminate 
dependence  on  working  set  sise. 

B«l.  CompututloB  tlm«f 

Our  results  indicate  that  the  gain  from  the  PPS-164  architecture  is  more  uniform  across 
algwithms  than  might  be  anticipated.  Graphs  for  each  algwithm  are  lacsented  in  Figure  5-1. 
The  ratios  of  timings  on  the  two  machines  for  the  algorithms  using  the  real  FFT  are  presented 
in  Table  5-2. 

The  results  presented  in  Figure  5-1  and  Table  5-1  are  for  FORTRAN  programs  which,  other 
than  for  calk  to  the  MATHLIB,  have  not  been  specially  tuned  for  the  APFTN  compiler.  In  thk 
table,  the  sine  transform  refers  to  the  non-vectorised  implementation  of  Cooley's  algorithm.  In 
the  remainder  of  this  section,  we  examine  several  possible  improvements  to  these  three 
algorithms. 

We  begin  by  presenting  a  breakdown  of  the  percentage  of  time  spent  in  each  secrion  of  riie 
programs.  (See  Table  5-3.) 

To  determine  whether  or  not  the  complete  Fourier  algorithm  is  necessarily  less  efficient  on 
the  parallel  architecture  of  the  AP,  we  fint  vectorised  the  normalisation,  which  reduced  the 


OaiBLE  FOURIER  METHOD  (USING  RERL  TRnNSFORN) 


DOUBLE  FOURIER  METHOD  (USING  SINE  TRANSFORM) 


SO  IDO  ISO  200  2SO  300  SSO  400  450  SOD  SO 
PROBLEM  SIZE  (Z-OlMENSIOMt..  SOURREl 
V  -  VflX  11/780 
F  -  RP  USING  flPFTN64 
M  -  RP  USING  flpmm  RNO  NRTHLIB 

FOURIER/TRIOIRGONRL  METHOD  (USING  REAL  TRRNSFORM) 


SO  100  ISO  200  250  SOD  3S0  400  4S0  SOO  5S0 
PRKLEM  SIZE  tZ-OlMENSlBNRL.  SOURRE) 

V  -  VAX  11/780 

F  -  RP  USING  WF7N64 

M  -  AP  USING  APFTNB4  AND  NATHLIB 

lER/miDIRGONAL  METHOD  (USING  SINE  TRRNSF0IM) 


100  150  200  250  300  SSO  400  450  SOO  S50 
PROBLEM  SIZE  (2-OINENSIONAL.  SOURRE) 

V  -  VRX  11/780 

F  •  RP  USING  RPFTNB4 

M  •  RP  USING  RPFTNB4  RNO  MRTMLIB 

FRCR(l)  fCTHBO  (USING  RERL  TRANSFORM) 


SO  100  ISO  200  250  300  350  4m  450  SSO  SSO 
PROBLEM  SIZE  (2-0I)CNSIBNRL.  SOURRE) 

V  •  VRX  11/780 

F  -  RP  USING  RPfTNB4 

N  -  RP  USING  RPFTN64  RNO  MRTHLIB 


FRCRO) 


(USING  SINE  TRANSFORM) 


SO  100  150  200  2S0  300  350  400  4S0  SOO  550 
PROBLEM  SIZE  (2-01)CNS10NRL.  SOURRE) 

V  -  VRX  11/780 
F  -  AP  USING  APFTN84 

M  -  AP  USING  APrnm  and  nrtmlib 


SO  100  ISO  200  250  300  350  400  450  500 
PROBLEM  SIZE  (2-OIMENSlONRL.  SQUARE) 

V  -  VRX  11/700 

F  -  RP  USING  RPFTNB4 

N  -  AP  USING  RPFTN84  AM)  NRTM.IB 


FIfvfw  S-lf  ExeeotioB  tiaos  (nc)  to  Aohre  Powoi’a  «q«atioa 


Poisson  ProbUs 

Using  RssI  Trsnsfors  Using  Sins  Trsnsfors 

FFT  FFT/Trid  FACR(l)  FFT  FFT/Trid  FACR(l) 


VAX  (FORTRAN) 

0.62 

0.36 

0.25 

0.44 

0.28 

0.20 

31x31 

AP  (FORTRAN) 

0.10 

0.05 

0.03 

0.08 

0.04 

0.03 

AP  (NATHLIB) 

0.04 

0.03 

0.02 

0.04 

0.03 

0.02 

VAX  (FORTRAN) 

2.61 

1.40 

0.00 

1.64 

1.00 

0.73 

63x63 

AP  (FORTRAN) 

0.34 

0.10 

0.12 

0.24 

0.14 

0.10 

AP  (NATHLIB) 

0.16 

0.10 

0.08 

0.18 

0.10 

0.08 

VAX  (FORTRAN) 

10.30 

5.00 

8.03 

6.72 

4.12 

3.00 

127x127 

AP  (FORTRAN) 

1.21 

0.67 

0.43 

0.87 

0.50 

0.35 

AP  (NATHLIB) 

0.62 

0.37 

0.27 

0.46 

0.20 

0.24 

VAX  (FORTRAN) 

50.33 

27.05 

17.78 

26.86 

16.47 

14.5 

255x255 

AP  (FORTRAN) 

5.30 

2.85 

1.76 

3.20 

1.70 

1.23 

AP  (NATHLIB) 

2.42 

1.30 

1.00 

1.77 

1.07 

0.85 

VAX  (FORTRAN) 

206.0 

116.74 

74.64 

123.07 

73.60 

58.16 

511x511 

AP  (FORTRAN) 

21.31 

11.26 

6.80 

13.64 

7.43 

4.00 

AP  (NATHLIB) 

10.22 

5.65 

3.07 

6.76 

3.03 

3.10 

Toble  S-1:  Execution  times  (sec)  to  soke  Poisson’s  eqnstion 


A  Igor i tbs  VAX/AP(FTN}  AP(FTN}/AP(NATHLIB)  VAX/AP(NATHL1B) 


Fourier 

10.50 

2.08 

22.08 

Fourier/Trid 

11.25 

1.00 

22.54 

FACR(l) 

11.84 

1.74 

20.56 

Table  6-Si  Ratios  of  times  (sec)  on  difloent  architectures  for  each  algorithm  using  real  FFT 

percentage  of  time  spent  in  the  normaiitstion  from  17%  to  8%.  (See  Table  6-4.)  For  the  255  x 
255  sise  problem,  .10  seconds  was  sarcd  bj  the  vectorised  dkide  in  a  nermalisatioa  that 
otherwise  took  .28  seconds,  for  a  67%  reduction  in  normalisation  time,  and  a  10%  savinp  in 
the  entire  Poisoon  proUem. 

As  was  diaeaased  in  Section  4,  the  complete  Fonricr  algorithm  has  the  most  to  gain  horn  an 
efficient  sine  transform.  We  implemented  the  Teetorised  sine  transform  from  Section  4  for  each 
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of  our  algorithms.  These  results  are  given  in  Table  5-5. 


SliSl 

127al27 

511i511 

Preparation 

0.5 

0.1 

0.0 

Sine  FFT  in  X 

21.9 

20.9 

20.7 

Fourier  Sine  FFT  in  Y 

22.1 

21.0 

20.7 

Noras  1 izstion 

11.5 

16.0 

17.1 

Back  FFT  in  Y 

22.1 

21.0 

20.7 

Back  FFT  in  X 

21.9 

20.9 

20.7 

Preparation 

0.5 

0.1 

0.0 

Sine  FFT 

33.0 

28.5 

27.5 

FFT/Trid  Trid  S^sten 

30.4 

39.8 

41.9 

(Std  Trid)  Back  FFT 

38.0 

28.5 

27.5 

Noras  t izstion 

8.1 

3.1 

8.0 

Preparation 

0.5 

0.1 

0.0 

Sine  FFT 

82.6 

35.1 

37.8 

FFT/Trid  Trid  S^stea 

31.1 

25.9 

20.3 

(Nev  Trid)  Back  FFT 

32.6 

35.1 

37.8 

Noras  1 izstion 

8.1 

3.9 

4.2 

Preparation 

0.4 

0.1 

0.0 

Odd/Eren  Redue 

8.0* 

10.6 

12.2 

Sine  FFT 

18.8 

20.4 

22.6 

FACR(l)  Trid  Spstea 

26.2 

22.7 

17.5 

(New  Trid)  Back  FFT 

18.8 

20.4 

22.6 

Noraal ization 

5.4 

6.1 

6.6 

Solve  Odd  Rows 

22.4 

19.7 

18.5 

TnbI*  5-St  %  time  spent  in  each  algorithm  using  sine  transform,  MATHLIB 


Poisson  Problen 


Using  Real 

Transfera 

Using  Sine  Transfera 

Fourier  Fowrier(Vee) 

Fourier 

Fourier(Vee) 

81i81 

.04 

.04 

.04 

.04 

68168 

.16 

.16 

.18 

.18 

127il27 

.62 

.56 

.46 

.48 

2551255 

2.42 

2.27 

1.78 

1.61 

5111511 

10.22 

9.58 

6.76 

6.11 

Tnblo  Times  times  (see)  using  Fourier  method  with  Veedhr  and  MATHLIB 
The  Tsetorised  sine  transform  does  result  in  a  notkahle  improvement  in  compute  time 
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Poisson  ProbiM 


Fourier 

Fourier 

Fourier 

Fourier 

FACR(l)  FACR(l) 

F/Trid  F/Trid 

Vsediv 

Vsesin 

Vsediv 

Vsesin 

Vsesin 

Vsesin 

31x31 

.04 

.04 

.08 

.03 

.02 

.02 

.03 

.02 

63x63 

.13 

.13 

.10 

.00 

.08 

.07 

.10 

.08 

127x127 

.46 

.43 

.37 

.83 

.24 

.21 

.20 

.25 

255x255 

1.78 

1.61 

1.46 

1.30 

.85 

.77 

1.07 

.05 

511x511 

6.76 

6.11 

n/s 

n/s 

3.10 

2.87 

3.03 

3.52 

Tftble  5-5:  Times  (sec)  using  complete  Fourier  method  with  optimitstions  and  MATHLIB 

although  it  is  more  storage  intensive.  Note  however  that  the  PACR(1)  method  still  provides  a 
significant  savings  in  time  over  the  complete  Fourier  method. 

To  cut  some  of  the  added  storage  costs,  we  explored  the  method  of  solving  m  systems  instead 
of  all  n  systems  in  the  FACR(1)  algorithm  in  parallel,  f»  m  «  15,  32,  and  54.  We  noted  that 
for  loops  of  site  less  than  32,  the  efltciency  of  the  APMATH  routines  over  APFTN  optimisation 
level  3  is  overshadowed  by  the  overhead  of  the  subroutine  calls.  We  present  times  for  the 
APFTN-vectorited  sine  transform  in  Table  S-5. 


bloek=16 

Poisson  Probisn 
bloek«32 

bloek=64 

31x31 

.02 

.02 

.02 

68x63 

.08 

.07 

.07 

127x127 

.23 

.22 

.22 

255x255 

.81 

.70 

.77 

511x511 

2.05 

2.87 

n/s 

Tnbln  6-6:  Times  (sec),  blocked-vectorited  sine  transform,  FACR(1)  method,  MATHLIB 

An  additional  optimisation  which  b  impcHtant  in  the  case  where  multiple  problems  are  being 
solved  b  the  preprocessing  of  the  normalisarion  in  the  complete  Fourier  algorithm.  Our 
pfcproceiring  consists  of  storing  the  reciprocal  of  the  normalisation  constants,  enabling 
normalisation  to  be  done  by  a  vector  muhipiy.  We  present  times  for  the  preiHweessed  complete 
Fourier  algorithm  in  Table  5>7. 

We  next  consider  the  Fourier/Tridiagonal  algorithm.  We  try  to  enable  the  maximum 
amount  of  pipelining  in  the  tridiagonai  solver  by  performing  the  forward  substitutions  for  m 
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Poisson  Probira 


prtproetssing 


tins 

31i31 

.00 

53i68 

.01 

1271127 

.03 

255i265 

.12 

Fou r i s r- vscfflu I - vtes i n 
sieluding  prsproesssing 
.03 
.00 
.34 
1.22 


Tsbis  5-7:  Execution  times  (sec)  using  Fourier  method  with  preprocessing 

pjrstems  in  psrsllel  followed  bjr  the  bsckwnrd  substitutions  in  the  solntion  of  the  tridisgonal 
linear  systems,  cf.  Section  3.2.  In  Table  5-8,  we  imsent  times  for  problems  of  sites  as  large  as 
255  X  255  using  the  real  FPT.  Though  the  method  becomes  more  storage  intensive  as  m  grows, 
we  found  that  our  times  did  not  improve  unless  m  was  >  31.  Note  that  the  improvement  in 
the  FACR(1)  algorithm  b  necessarily  leas  since  only  n/2  tridiagonal  linear  systems  are  solved. 


Poisson  Probisn  Using  Real  Trsnsfornstion 


Fourisr/Trid 

Fourlsr/Vse-Trid 

FACR(l) 

FACR(1)/V#c 

Sli81 

.03 

.03 

.02 

.02 

63i63 

.10 

.09 

.08 

.08 

127S127 

.37 

.36 

.27 

.26 

255I2S5 

1.39 

1.32 

1.00 

.98 

Table  5-8:  Times  (secs)  to  solve  Poissoci’s  equation.  Vectorising  the  tridiagonal  solves 

6.S.  Compiler  Efflelenep 

Finally  we  examine  the  capability  of  the  architecture  of  the  FPS-184  as  demonstrated  by  nse 
of  tbe  MATHLIB  routines  and  the  extent  to  which  the  existing  compiler  utilises  it.  The  times 
recorded  by  the  APFTN  programs  vary  significantly  depending  on  both  the  level  of 
optimisation  of  the  FPS  compiler  and  the  release.  Optimisation  levd  3  of  Version  D04  of  the 
APFTN54  compiler  was  used  in  all  the  testa  of  thb  paper.  Previous  runs  had  been  made  using 
optimisation  level  2  (due  to  errors  at  opt  level  3)  of  the  Versioa  C  compilCT,  also  at  optimisation 
level  2  of  the  D04  compiler.  We  present  these  times  for  comparison  in  Table  5-9. 

Note  that  the  architectare  of  the  FPS-104  offers  a  degree  of  parallelism  that  b  not  fully 
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Fourier/Trid 

using  Sins  Transfora,  APFTN 

0pt2,RevC 

0pt2.RsvD04 

0pt3,RsvD04 

31x31 

.10 

.05 

.04 

63x63 

.33 

.19 

.14 

127x127 

1.16 

.67 

.50 

255x255 

4.39 

2.44 

1.07 

511x511 

16.29 

10.01 

5.0 

FACR(l)  using  Sins  Transfora,  APFTN 

0pt2.RavC 

0pt2,RsvD04 

0pt3,RovD04 

31x31 

.04 

.04 

.03 

63x63 

.15 

.13 

.10 

127x127 

.53 

.48 

.35 

255x255 

2.00 

1.89 

1.23 

511x511 

8.3 

srror 

4.99 

FACR(l) 

using  Sins  Transfora,  NATHLIB 

0pt2.RovC 

Opt2,Rsv004 

0pt3,Rovl)04 

31x31 

.03 

.03 

.02 

63x63 

.10 

.09 

.08 

127x127 

.31 

.28 

.24 

255x255 

1.10 

1.05 

.85 

511x511 

4.03 

3.94 

3.10 

Table  6>9:  Execution  times  (sec)  for  Version  C  and  Version  D  APFTN  compiler 

utilized  bj  any  version  of  the  compiler.  The  sections  of  the  program  which  make  use  of  the 
APAL  math  library  improve  the  running  time  of  the  entire  Poisson  problem  by  factors  of  as 
much  as  2,  depending  on  the  algorithm.  We  are  expecting  research  in  radical  compiler 
techniques  at  Yale  to  provide  a  FORTRAN  compiler  that  will  reduce  these  factors  [0]. 


We  conclude  by  comparing  the  best  times  we  were  able  to  calculate  for  the  FPS>164  against 
the  published  results  obtained  for  the  CDC  7600  by  Swarztranber  [11].  We  graph  the  time  to 
solve  the  Poisson  equation  for  each  of  our  three  algorithms,  making  use  of  MATHLIB  calls  and 
vectorization.  These  are  presented  in  Figure  5-2. 
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PReSLEH  SIZE  (2-0IMEriSr2Na..  SQtJRRE) 

D  -  OaUBLE  F0URIER  (VECOIV. VECSINl 
T  -  F0UR1ER/TRID  tVECSlN) 

R  -  FflCRtll  (VECSIN) 

1  -  7600  RESULTS  -  FBURIER/TRIO 

2  -  760)  RESULTS  -  FflCRd) 


Figure  5*2i  Best  times  for  the  soiutiou  of  the  Poisson  problem  on  the  AP 


5.9.  Overhead  in  using  the  AP 

We  hsTc  found  the  time  required  to  transfer  the  dstn  over  the  Unibns  connecting  the  AP  nnd 
the  VAX  to  be  significant.  For  the  31x31,  127x127  and  511x511  site  problems,  the  entire 
oTerhead  including  data  transfer  has  been  .8,  1.1,  and  5.9  seconds.  Our  real  throughput  time  to 
transfer  the  data  over  the  Unibns  has  been  less  than  one  half  megabyte  per  second. 


The  FPS-184  supports  a  D04  subsystem  which  is  directly  connected  to  the  FPS.184  I/O  bus. 
This  has  a  maximum  1.2  megabyte/second  transfer  rate,  which  is  comparable  to  the  Unibus 
transfer  rate.  One  problem  of  future  interest  is  the  out^f-core  Poisson  problem.  See 
Schuhs  [10]  for  an  algorithm  analysis  of  this  problem.  We  expect  future  results  which  make  use 
of  a  bulk  memory  system  which  we  have  interfaced  to  the  PPS>104  I/O  bus  which  adueres  a  41 
megabyte/second  transfer  rate. 
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6.  Conclusions 

Some  of  the  conclusions  that  should  be  noted  are  the  following: 

•  The  Malcolm  and  Palmer  technique  for  the  solution  of  the  specialised  tridiagonal 
system  of  linear  equations  which  results  from  the  Poisson  problem  is  very  eflicient. 

•  Efficient  implementation  of  transforms  lessen  the  difference  between  the  complete 
Fourier,  Fourier/Tridiagonal  and  FACR(1)  methods  on  the  FPS'154.  However,  the 
most  efficient  method  for  the  FPS-104  architecture  is  still  the  FACR(1)  method^. 

•  Except  for  very  small  problems,  in  each  of  the  algorithms,  utilization  of  the 
MATHLIB  improves  the  execution  time  over  FORTRAN  code  of  up  to  a  factor  of  2. 

•  The  overhead  of  data  throughput  is  significant  in  proportion  to  the  compote  time  on 
the  FPS-164. 

•  Optimized  code  for  the  FPS>164  has  been  shown  to  be  more  efficient  than  standard 
code  for  the  same  algorithm  on  the  CDC  7000. 
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Appendix  A 

Upper  bound  for  Convergent  Tridiagonal  Solver 

In  Section  3,  we  outlined  the  fast  tridiagonal  solver  presented  bj  Malcolm  and  Palmer  [0]. 
They  present  upper  and  lower  bounds  of  convergence  of  the  r^  of  equation  (28).  In  this 
Appendix,  we  solve  the  nonlinear  difference  equation  (28)  explicitly  and  subsequently  arrive  at 
an  eqiiivalent  upper  bound  of  convergence  to  a  predefined  accuracy. 

Let  2|  and  ^  solutions  of  s  =  1/(X  •  t),  the  equation  satisfied  by  the  limit  of  r^,  such 
that  |i||  >  Itjl'  Let  ut  :=  Xj/'r  X  <•  2,  as  in  the  Poisson  equation,  we  have  X  <  tj  <  >2 
<  0  and  therefore  0  <  a;  <  1. 

Clearly 


X  -  ^/x2  -  4  X  ♦  Vx*  -  4 


X  =  Zj  ♦  Zj  =  Zj  (I  ♦  w)  . 
and 


It  can  be  shown  by  induction  that 


1  - 

'  z,(l  - 

is  the  general  solution  of  the  recurrence  formula  (28).  This  satisfies  rj 
induction  hypothesis,  it  follows  that: 


(32) 

1/X.  Given  the 


I  _  1  - 

'  X^  '  X(1  -  -  z;*(l  - 

1 

Zj(Xzj*  -  Xzj*«‘  " 


Note  that  — *  1  /S|  =  >2  "  monotose  for  k  — »  00. 


It  is  easy  to  determine  the  index  k^^^  such  that  rj^  has  conTerged  with  maximum  relative 

CMV 

eiTor  c.  We  determine  a  simple  upper  bound,  Icq. 


«*(!  -  <j) 

^  (1  - 

<  <  <  • 


Therefore 


This  expression  for  k^  is  equivalent  to  the  upper  bound  given  by  Malcolm  and  f'almer. 
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